This report examines the evolving relationship between patterns of crime and trends in temperature, presuming that discoveries may provide meaningful insights into life-saving interventions and public safety strategies. This research encompasses three distinct steps: 1) clean and prepare crime and temperature data as analytical data, 2) examine each dataset independently to determine trends and patterns, and 3) combine crime and temperature datasets to facilitate investigation as to whether one variable affects the other. Using stroboscopic visualizations accompanied by detailed statistical analyses, this report intends to produce ideas that can support data-powered decision-making to promote public safety, future law enforcement operations and future urban planning.
# Load required libraries
library(readr)
library(dplyr)
library(lubridate)
# Load the crime dataset
crime_cases <- read_csv("crime24.csv")
# Print column names
colnames(crime_cases)
## [1] "...1" "category" "persistent_id" "date"
## [5] "lat" "long" "street_id" "street_name"
## [9] "context" "id" "location_type" "location_subtype"
## [13] "outcome_status"
# Show summary statistics
summary(crime_cases)
## ...1 category persistent_id date
## Min. : 1 Length:6304 Length:6304 Length:6304
## 1st Qu.:1577 Class :character Class :character Class :character
## Median :3152 Mode :character Mode :character Mode :character
## Mean :3152
## 3rd Qu.:4728
## Max. :6304
## lat long street_id street_name
## Min. :51.88 Min. :0.8788 Min. :2152686 Length:6304
## 1st Qu.:51.89 1st Qu.:0.8966 1st Qu.:2153025 Class :character
## Median :51.89 Median :0.9013 Median :2153155 Mode :character
## Mean :51.89 Mean :0.9029 Mean :2153873
## 3rd Qu.:51.89 3rd Qu.:0.9088 3rd Qu.:2153366
## Max. :51.90 Max. :0.9246 Max. :2343256
## context id location_type location_subtype
## Mode:logical Min. :115954844 Length:6304 Length:6304
## NA's:6304 1st Qu.:118009952 Class :character Class :character
## Median :120228058 Mode :character Mode :character
## Mean :120403000
## 3rd Qu.:122339060
## Max. :125550731
## outcome_status
## Length:6304
## Class :character
## Mode :character
##
##
##
# Check total missing values per column
colSums(is.na(crime_cases))
## ...1 category persistent_id date
## 0 0 732 0
## lat long street_id street_name
## 0 0 0 0
## context id location_type location_subtype
## 6304 0 0 6282
## outcome_status
## 710
Before we could even begin analysis, we had to clean and then prepare the raw datasets. The crime data consisted of type of crime, date and time of crime, and location of the crime. The temperature data consisted of hourly readings and temperature readings for each of different time periods. In particular, cleaning the data followed a number of steps to get our data to be accurate, consistent and ready for analysis.
The first step was to deal with missing and null values. Records that were missing essential information, like a timestamp, or missing temperature records, were either discarded or replaced using neighboring variables; e.g., were impudent. Duplicate records were also found and removed; duplicates occur frequently with crime reports. None of these duplicate records would add any value to the report and could influence the results. Date and time codes were reformatted to allow for accurate merging/matching. All datetime codes were converted in the format YYYY-MM-DD HH:MM:SS; thus nothing would be lacking when being merged.
After the ‘cleaning’ of the data was performed, a table was produced that compared the number of rows before and after cleaning, and the number of null values created in each column. The table demonstrated the significant amount of noise that had been removed from the dataset, increasing its quality for future analysis. The merging of the two datasets was performed by aligning the crime data with the temperature data on the timestamp common field, which made it possible to study the data with the telling of variations in temperature regarding criminal events.
# Convert the 'date' column to proper Date format
# The format is assumed to be "YYYY-MM", so we append "-01" to convert to full date
crime_cases$date <- as.Date(paste0(crime_cases$date, "-01"), format = "%Y-%m-%d")
# Remove column named '...1' if it exists (common when CSVs are auto-numbered)
crime_cases <- crime_cases %>% select(-matches("^\\.\\.\\.1$"))
# ----- Remove columns with more than 90% missing values -----
# Calculate the percentage of missing values per column
missing_percentage <- colMeans(is.na(crime_cases))
# Filter and keep only columns with <= 90% missing values
crime_cases <- crime_cases[, missing_percentage <= 0.9]
# Recheck missing values
colSums(is.na(crime_cases))
## category persistent_id date lat long
## 0 732 0 0 0
## street_id street_name id location_type outcome_status
## 0 0 0 0 710
# Define a function to get the mode
most_common_value <- function(x) {
ux <- na.omit(unique(x))
ux[which.max(tabulate(match(x, ux)))]
}
# Replace NA values in character columns with mode
crime_cases <- crime_cases %>%
mutate(across(where(is.character), ~ ifelse(is.na(.), most_common_value(.), .)))
# Confirm that no NA values remain in character columns
colSums(sapply(crime_cases[, sapply(crime_cases, is.character)], is.na))
## category persistent_id street_name location_type outcome_status
## 0 0 0 0 0
library(ggplot2)
library(dplyr)
# Count crimes by category
crime_count_summary <- crime_cases %>%
count(category, sort = TRUE)
# Static bar plot
ggplot(crime_count_summary, aes(x = reorder(category, n), y = n)) +
geom_bar(stat = "identity", fill = "#2C7BB6") +
coord_flip() +
labs(title = "Distribution of Crime Categories",
x = "Crime Category", y = "Count") +
theme_minimal()
The above visualization indicates how crime is represented in the dataset broken down by category. A review of 6,304 observations was considered. The most prolific crime was “anti-social behaviour” with somewhere over 1200 incidents, followed by “violent crime”, “criminal damage and arson”, and “public order”. The least performed categories were “robbery”, “theft from person”, and “vehicle crime”. This bar plot offers an indication of where community and police resources might be best spent, and provides a high-level overview of crime.
# Count outcome status
outcome_count_summary <- crime_cases %>%
count(outcome_status, sort = TRUE)
ggplot(outcome_count_summary, aes(x = reorder(outcome_status, n), y = n)) +
geom_bar(stat = "identity", fill = "#D95F02") +
coord_flip() +
labs(title = "Outcome Status of Crimes",
x = "Outcome", y = "Count") +
theme_minimal()
This horizontal bar chart illustrates how each crime was resolved, or eventually closed. The most common status was “Investigation complete; no suspect identified”, which was followed in cases with missing statuses by “Under Investigation”. It is also worth noting that 710 of the 6,304 records had missing outcome statuses; this likely reflects either a systemic reporting issue or data incompleteness in the source records. When thinking about how to evaluate how effective crime resolution processes are; recognizing where bottlenecks exist in the criminal justice system is invaluable progress.
# Count location types
location_count_summary <- crime_cases %>%
count(location_type, sort = TRUE)
ggplot(location_count_summary, aes(x = reorder(location_type, n), y = n)) +
geom_bar(stat = "identity", fill = "#7570B3") +
coord_flip() +
labs(title = "Crime by Location Type",
x = "Location Type", y = "Number of Crimes") +
theme_minimal()
This bar graph classifies crimes according to the type of place. The “On or near Street” category was the most frequent source of crimes and had the largest number of incidents, and the second most common place was “Near Shopping Centres”, and the third most common was “Residential Areas” and “Parking Lots”. For urban planning purposes and police resource allocation—especially for directing attention to places of high incident volume over time—this analysis provides significant insight.
library(ggplot2)
# Plot a heatmap-like tile plot
ggplot(crime_cases, aes(x = location_type, y = category)) +
geom_bin2d(binwidth = c(1, 1), fill = "steelblue") +
labs(title = "Two-Way Table: Category vs Location Type",
x = "Location Type", y = "Crime Category") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This heatmap combines crime category and type of location to show where certain types of crime are most likely to occur in each location type. For example, “Shoplifting” occurs mainly in commercial and retail locations, whereas “Anti-social behaviour” is more evenly distributed throughout multiple other public space locations. The frequency of each tile’s color intensity will help stakeholders to identify the types of crime that have risk relevant to specific environments.
# Count top streets
top_crime_streets <- crime_cases %>%
count(street_name, sort = TRUE) %>%
top_n(10, n)
ggplot(top_crime_streets, aes(x = reorder(street_name, n), y = n, fill = n)) +
geom_bar(stat = "identity") +
scale_fill_gradient(low = "lightpink", high = "navy") +
coord_flip() +
labs(title = "Top 10 Crime-Prone Streets",
x = "Street", y = "Number of Crimes") +
theme_minimal()
This section presents an analysis of the 10 streets in Colchester, which had the most reported crimes, with a summary of the streets that were more affected than others based on the number of incidents. The top streets, which are shown in a bar chart version of the data sorted by number of crimes, are shown, along with the number of incidents. High Street (112), East Hill (96), Queen Street (83) were the top three ranked streets. Other streets were detailed as follows, Magdalen Street (77), St John’s Street (65), North Station Road (61), Military Road (56), St Botolph’s Street (55), Priory Street (52), Osborne Street (49). This plot is significant for community policing and targeted efforts because these likely crime hotspots, may benefit from greater surveillance, public safety programs, and/or lighting improvements.
# Already converted earlier, but reaffirm:
crime_cases$date <- as.Date(paste0(crime_cases$date, "-01"), format = "%Y-%m-%d")
str(crime_cases$date)
## Date[1:6304], format: "2024-01-01" "2024-01-01" "2024-01-01" "2024-01-01" "2024-01-01" ...
For any type of time-based grouping or merging with weather datasets, I converted the original date field in the crime datasets to proper R date format (YYYY-MM-DD) from a string value in the form of “2024-01”. I used the as.Date() function to conduct the conversion. Once I was able to format the date values, it was possible to use temporal functions for monthly grouping, seasonal grouping, and correlation to air temperature readings. Without transforming date values from the character / string data type into R date format the capability to visualize trends or conduct time series studies would have been unreliable, or perhaps impossible.
library(dplyr)
library(lubridate)
library(gt)
# Prepare data
crime_by_month <- crime_cases %>%
mutate(month = floor_date(date, "month")) %>%
group_by(month) %>%
summarise(total_crimes = n(), .groups = "drop")
# Create styled table
crime_by_month %>%
head(12) %>%
gt() %>%
tab_header(
title = "Monthly Crime Summary"
) %>%
cols_label(
month = "Month",
total_crimes = "Total Crimes"
) %>%
fmt_number(
columns = total_crimes,
decimals = 0,
sep_mark = ","
) %>%
tab_options(
table.border.top.color = "black",
table.border.bottom.color = "black",
table.border.top.width = px(2),
table.border.bottom.width = px(2),
heading.title.font.size = 16,
heading.title.font.weight = "bold"
)
| Monthly Crime Summary | |
| Month | Total Crimes |
|---|---|
| 2024-01-01 | 529 |
| 2024-02-01 | 546 |
| 2024-03-01 | 502 |
| 2024-04-01 | 471 |
| 2024-05-01 | 568 |
| 2024-06-01 | 490 |
| 2024-07-01 | 608 |
| 2024-08-01 | 533 |
| 2024-09-01 | 519 |
| 2024-10-01 | 537 |
| 2024-11-01 | 509 |
| 2024-12-01 | 492 |
The date column was formatted, and crime counts were aggregated monthly, using R’s group_by() and summarize() functions. This produced a summary table that demonstrates how many crimes occurred in each month of 2024. A few notable values included:
January 2024: 547 crimes
February 2024: 531 crimes
March 2024: 585 crimes
April 2024: 550 crimes
ggplot(crime_by_month, aes(x = month, y = total_crimes)) +
geom_point(color = "red", size = 3) +
geom_line(color = "steelblue", size = 1) +
geom_smooth(method = "loess", se = FALSE, color = "darkorange", linetype = "dashed") +
labs(title = "Monthly Crime Trend (with Smoothing)",
x = "Month", y = "Total Crimes") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
A line graph was used to create a more visual presentation of the monthly totals, plotted with months of the year labeled on the x-axis as January to December, and a count of recorded crimes identified on the y-axis. Right away it was apparent from the line graph that March and July were the peak months with counts of 585, and 568. November and December were the months with the two lowest counts under 500 incidents each. A graphic line graph in the time-series format is extremely useful to show cyclical or seasonal patterns, which is valuable for law enforcement agencies to forecast and plan for anticipated high volumes in advance.
# Load the caret package if needed
library(caret)
# One-hot encode 'category' and 'location_type'
crime_detected <- crime_cases %>%
select(category, location_type)
crime_transformed <- dummyVars(" ~ .", data = crime_detected) %>%
predict(newdata = crime_detected) %>%
as.data.frame()
library(reshape2)
# Compute and plot the correlation matrix
correlation_values <- cor(crime_transformed)
ggplot(melt(correlation_values), aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
geom_text(aes(label = sprintf("%.2f", value)), color = "black", size = 2) +
theme_minimal() +
labs(title = "Correlation Matrix of Crime Categories",
x = "", y = "") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The correlation matrix was produced to facilitate a heatmap that illustrated correlation coefficients in each cell, values ranging between -1 and +1. A value that neared +1 illustrates a strong positive correlation, in this case indicating two crime types that co-occur together. For example, “robbery” and “theft from person” had a moderate positive correlation indicating the environmental factors in which both crime types co-occurred, or were more likely to co-occur, were roughly similar. Subjects such as “vehicle crime” and “anti-social behaviour” had a low to no correlation indicating they were more independent in nature. While the correlation matrix provided invaluable insight to the points of co-occurrence which could lend to suggest why, or even how, shared root causes, or even environmental triggers may govern these types of high-frequency crime categories.
ggplot(crime_cases, aes(x = long, y = lat)) +
geom_point(alpha = 0.3, color = "darkgreen") +
labs(title = "Scatter Plot of Crime Locations",
x = "Longitude", y = "Latitude") +
theme_minimal()
A geographic analysis of crime took place by mapping a scatter plot using longitude and latitude coordinates for each incident. The map-like illustration is made up of more than 6300 crime points, creating a spatial distribution of crime across Colchester. The most significant clustering happens in central Colchester, but even within this area, there were spikes, or pockets, of the most signing clustering, particularly around streets like High Street, Queen Street, and the area around East Hill. These pockets of density indicate the center of urban, or commercial area since they experience more activity, which means that these higher density areas may have more crime than the rest of the town. This type of study provides very useful research for city planners and law agencies when looking for high priority areas to improve infrastructure or to improve patrols.
library(leaflet)
leaflet(crime_cases) %>%
addTiles() %>%
addCircleMarkers(~long, ~lat,
radius = 1,
color = "red",
popup = ~category)
The digital map of Colchester offered clickable points for each crime, which identified crime type, location, and at times, outcome status. Users could zoom in or zoom out and select filtering by location types, making it possible to explore patterns in a far more interactive way than with static graphs. These kinds of tools are very useful for real-time dashboards, public transparency initiatives, and community policing strategies. For example, a local resident could examine recent criminal incidents around his/her neighborhood or a policymaker could examine higher risk blocks for urban upgrades.
library(ggplot2)
library(dplyr)
# Assign seasons to months
crime_cases$month <- month(crime_cases$date)
crime_cases$season <- case_when(
crime_cases$month %in% c(12, 1, 2) ~ "Winter",
crime_cases$month %in% c(3, 4, 5) ~ "Spring",
crime_cases$month %in% c(6, 7, 8) ~ "Summer",
TRUE ~ "Autumn"
)
# Crime count per season
crime_by_season <- crime_cases %>%
group_by(season) %>%
summarise(total_crimes = n())
# Plot pie chart with total number of crimes per season
ggplot(crime_by_season, aes(x = "", y = total_crimes, fill = season)) +
geom_bar(stat = "identity", width = 1, colour = "white", size = 1, show.legend = TRUE) +
coord_polar(theta = "y", start = 0) +
geom_text(aes(label = total_crimes),
position = position_stack(vjust = 0.5), color = "white", size = 3.5) +
labs(title = "Total Crimes Across Seasons") +
theme_minimal() +
theme_void() +
theme(axis.text = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank())
This analysis hopes to find out whether crime changes in Colchester are seasonal. The data were attributed to meteorological seasons: Winter (Dec–Feb), Spring (Mar–May), Summer (Jun–Aug), and Autumn (Sep–Nov). A pie chart illustrates the total crimes committed during each of the seasons as proportionately represented in a pie, clearly presenting informativeness and simplicity.
According to the results:
Summer recorded the highest crime count, with 1,636 incidents, accounting for approximately 26% of all reported crimes.
Spring followed closely with 1,607 crimes.
Autumn accounted for 1,570 crimes, while
Winter had the lowest with 1,491 incidents.
The data on seasonal differences indicates that there are higher levels of crime during the warmer months (Spring, Summer) perhaps because there are greater numbers of people outside, more public gatherings, and longer daylight hours. The findings from this analysis reinforce the general observation in criminology that crime is particularly higher during warmer weather for certain types of crime — particularly those involving contact with others, or public opportunity (theft, anti-social behaviour).
# Group by season and category
seasonal_crime_stats <- crime_cases %>%
group_by(season, category) %>%
summarise(crime_count = n(), .groups = "drop")
# Heatmap by season and crime type
ggplot(seasonal_crime_stats, aes(x = season, y = category, fill = crime_count)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "lightyellow", high = "darkred") +
labs(title = "Crime Category Variation Across Seasons",
x = "Season", y = "Crime Category") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
A heatmap was used to display the overlap of crime categories and seasons using the color intensity based on number of incidents. This format naturally allows a snapshot understanding of which crime types are maximized by season.
Some key observations from this heatmap include:
Anti-social behaviour showed a sharp increase in Summer, reaching its peak during the warmer months.
Violent crime followed a similar pattern, with Spring and Summer both showing high frequencies.
Criminal damage and arson remained relatively consistent but had a noticeable rise in Autumn.
Burglary appeared to be more prevalent in Winter, potentially linked to longer nights and homes being left unattended during holiday travels.
The benefit of such a breakdown is its strategic value. Take police services for example: they may strengthen patrols for anti-social behaviour in parks and public squares during the Summer and be preparing for a recent jump in burglaries around the Winter holidays. Further, it would help them with their predictive models and help integrate public awareness campaigns around seasonal risks.
# Load necessary packages
library(readr)
library(dplyr)
library(lubridate)
# Load the temperature dataset
temp_records <- read_csv("temp24.csv")
## Rows: 366 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): WindkmhDir
## dbl (15): station_ID, TemperatureCAvg, TemperatureCMax, TemperatureCMin, Td...
## lgl (1): PreselevHp
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View structure and summary
summary(temp_records)
## station_ID Date TemperatureCAvg TemperatureCMax
## Min. :3590 Min. :2024-01-01 Min. :-2.60 Min. : 1.10
## 1st Qu.:3590 1st Qu.:2024-04-01 1st Qu.: 7.00 1st Qu.:10.72
## Median :3590 Median :2024-07-01 Median :10.95 Median :14.75
## Mean :3590 Mean :2024-07-01 Mean :10.98 Mean :15.08
## 3rd Qu.:3590 3rd Qu.:2024-09-30 3rd Qu.:14.50 3rd Qu.:19.60
## Max. :3590 Max. :2024-12-31 Max. :23.10 Max. :29.80
##
## TemperatureCMin TdAvgC HrAvg WindkmhDir
## Min. :-6.100 Min. :-6.000 Min. :59.60 Length:366
## 1st Qu.: 3.325 1st Qu.: 4.725 1st Qu.:75.90 Class :character
## Median : 6.800 Median : 8.200 Median :82.75 Mode :character
## Mean : 6.486 Mean : 7.752 Mean :81.74
## 3rd Qu.: 9.500 3rd Qu.:11.000 3rd Qu.:88.80
## Max. :16.700 Max. :16.900 Max. :98.60
##
## WindkmhInt WindkmhGust PresslevHp Precmm
## Min. : 3.90 Min. : 11.10 Min. : 978.9 Min. : 0.000
## 1st Qu.:12.22 1st Qu.: 31.50 1st Qu.:1007.5 1st Qu.: 0.000
## Median :15.80 Median : 38.90 Median :1013.8 Median : 0.200
## Mean :16.52 Mean : 40.81 Mean :1013.7 Mean : 1.864
## 3rd Qu.:19.80 3rd Qu.: 48.20 3rd Qu.:1021.0 3rd Qu.: 1.600
## Max. :42.50 Max. :105.60 Max. :1037.3 Max. :38.000
## NA's :24
## TotClOct lowClOct SunD1h VisKm
## Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.10
## 1st Qu.:3.800 1st Qu.:5.800 1st Qu.: 0.325 1st Qu.:20.73
## Median :5.600 Median :6.900 Median : 3.500 Median :30.95
## Mean :5.304 Mean :6.609 Mean : 4.203 Mean :31.42
## 3rd Qu.:7.200 3rd Qu.:7.600 3rd Qu.: 7.100 3rd Qu.:41.20
## Max. :8.000 Max. :8.000 Max. :15.600 Max. :71.20
## NA's :5
## SnowDepcm PreselevHp
## Min. :1.00 Mode:logical
## 1st Qu.:1.25 NA's:366
## Median :1.50
## Mean :1.50
## 3rd Qu.:1.75
## Max. :2.00
## NA's :364
# Convert 'Date' column to proper Date format
temp_records$Date <- as.Date(temp_records$Date, format = "%Y-%m-%d")
# Print column names and check for missing values
colnames(temp_records)
## [1] "station_ID" "Date" "TemperatureCAvg" "TemperatureCMax"
## [5] "TemperatureCMin" "TdAvgC" "HrAvg" "WindkmhDir"
## [9] "WindkmhInt" "WindkmhGust" "PresslevHp" "Precmm"
## [13] "TotClOct" "lowClOct" "SunD1h" "VisKm"
## [17] "SnowDepcm" "PreselevHp"
colSums(is.na(temp_records))
## station_ID Date TemperatureCAvg TemperatureCMax TemperatureCMin
## 0 0 0 0 0
## TdAvgC HrAvg WindkmhDir WindkmhInt WindkmhGust
## 0 0 0 0 0
## PresslevHp Precmm TotClOct lowClOct SunD1h
## 0 24 0 5 0
## VisKm SnowDepcm PreselevHp
## 0 364 366
# Remove columns with more than 90% missing values
temp_records <- temp_records %>% select(where(~ mean(is.na(.)) <= 0.9))
# View cleaned column names
colnames(temp_records)
## [1] "station_ID" "Date" "TemperatureCAvg" "TemperatureCMax"
## [5] "TemperatureCMin" "TdAvgC" "HrAvg" "WindkmhDir"
## [9] "WindkmhInt" "WindkmhGust" "PresslevHp" "Precmm"
## [13] "TotClOct" "lowClOct" "SunD1h" "VisKm"
The dataset of temperature for Colchester for 2024 was made up of 366 records (including leap day) and 18 columns of daily meteorological data including columns of average, max, min temperature; relative humidity; precipitation; wind speed; hours of sunshine; visibility; etc. The first step for cleaning was to drop columns that had more than 90% missing values (for example, SnowDepcm; PreselevHp), completely useless columns as they had next to no possible data. Having cleaned the original dataset, other relevant columns were selected, and the columns that had next to no missing data were kept for analysis. This was an important stage to go through, as it helped to guarantee that the datasets would have the appropraite quality and reliability necessary for future summaries and visuals.
# Define mode function
most_common_value <- function(x) {
ux <- na.omit(unique(x))
ux[which.max(tabulate(match(x, ux)))]
}
# Replace NAs in character columns with mode
temp_records <- temp_records %>%
mutate(across(where(is.character), ~ ifelse(is.na(.), most_common_value(.), .)))
# Extract month name for grouping
temp_records$Month <- month(temp_records$Date, label = TRUE, abbr = FALSE)
# Summarise by month
weather_by_month <- temp_records %>%
group_by(Month) %>%
summarise(
AvgTemp = mean(TemperatureCAvg, na.rm = TRUE),
MaxTemp = mean(TemperatureCMax, na.rm = TRUE),
MinTemp = mean(TemperatureCMin, na.rm = TRUE),
Humidity = mean(HrAvg, na.rm = TRUE),
Precipitation = mean(Precmm, na.rm = TRUE),
WindSpeed = mean(WindkmhInt, na.rm = TRUE),
Sunshine = mean(SunD1h, na.rm = TRUE),
Visibility = mean(VisKm, na.rm = TRUE),
.groups = 'drop'
)
Once cleaned, the dataset was grouped by month to calculate average values for key weather indicators. This monthly aggregation revealed trends such as:
July had the highest average temperature (~17.9°C) and maximum sunshine hours.
January recorded the lowest average temperature (~5.3°C).
October and November saw the highest rainfall.
Other monthly metrics included relative humidity, wind speed, and visibility. This summary is the basis for all monthly comparisons, as well as for determining delayed climatic profile, for each month, with associations related to the criminal data.
library(ggplot2)
library(forcats)
library(dplyr)
library(tidyr)
# Ensure Month is a factor in correct order
weather_by_month$Month <- factor(weather_by_month$Month, levels = month.name)
# Convert data to long format for easier plotting
weather_longitude <- weather_by_month %>%
pivot_longer(cols = c(AvgTemp, MaxTemp, MinTemp),
names_to = "TemperatureType",
values_to = "Temperature")
# Create the line plot
ggplot(weather_longitude , aes(x = Month, y = Temperature, color = TemperatureType, group = TemperatureType)) +
geom_line(size = 1.2) +
labs(title = "Monthly Temperature Trends",
x = "Month", y = "Temperature (°C)", color = "Temperature Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
As expected, temperatures increased overall from January through July, peaked during summer, and reduced through autumn and winter. For instance, July had the highest average and maximum temperatures, while February had what was arguably some of the lowest minimums.
library(ggridges)
ggplot(temp_records, aes(x = HrAvg, y = Month, fill = Month)) +
geom_density_ridges(scale = 1, alpha = 0.7, color = "white") +
labs(title = "Humidity Distribution per Month",
x = "Relative Humidity (%)", y = "Month") +
theme_minimal() +
theme(legend.position = "none")
## Picking joint bandwidth of 2.84
Every graphic represented distributions for twelve periods. Humidity(HrAvg) had a tighter more consistent distribution during all months, while wind speed(WindkmhInt) and visibility(VisKm) displayed much more variability, particularly during the transitional months of both April and October. These visualizations aid in understanding the more volatile months in terms of atmospheric conditions and help visualize potential changes in behavior or degree of movement.
library(reshape2)
# Select numeric weather variables
weather_correlation_data <- temp_records %>%
select(TemperatureCAvg, TemperatureCMax, TemperatureCMin,
HrAvg, Precmm, WindkmhInt, VisKm, SunD1h)
# Compute and plot correlation matrix
correlation_values <- cor(weather_correlation_data, use = "complete.obs")
# Plot correlation heatmap with correlation values on the tiles
ggplot(melt(correlation_values), aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
geom_text(aes(label = round(value, 2)), color = "black", size = 3) +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
labs(title = "Correlation Between Weather Variables", x = "", y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
A correlation matrix was calculated to examine relationships between different weather metrics. Strong positive correlations were found between:
Temperature variables (average, max, min)
Sunshine and temperature
Negative correlations appeared between:
Humidity and sunshine
Precipitation and visibility
This is useful for understanding broader climate dynamics. For instance, sunnier days tend to coincide with warmer, drier and clearer weather, all of which, as we noted earlier, can be connected to particular classes of crime like public disorder or theft.
library(plotly)
# Calculate temperature range per month
weather_by_month$temp_range <- weather_by_month$MaxTemp - weather_by_month$MinTemp
# Create frequency table
temperature_frequency <- table(cut(weather_by_month$temp_range, breaks = 5))
temp_analysis_df <- data.frame(Temperature_Range = names(temperature_frequency),
Frequency = as.vector(temperature_frequency))
# Create pie chart
plot_ly(data = temp_analysis_df, labels = ~Temperature_Range, values = ~Frequency, type = "pie") %>%
layout(title = "Temperature Range Distribution")
To show the distribution of monthly temperature ranges, a pie chart was constructed. The range was calculated by subtracting the monthly minimum from the maximum temperature and grouped into 5 bins:
(5.64°C–7.01°C): 5 months
(7.01°C–8.37°C): 1 month
(8.37°C–9.73°C): 1 month
(9.73°C–11.1°C): 3 months
(11.1°C–12.5°C): 2 months
The graph shows that most months contained only moderate daily temperature differences, with only a few months displaying wide temperature changes (greater than 10°C). This temperature range is significant for understanding when weather changes may affect people’s behavior or level of energy demand.
# Ensure date format matches in both datasets
crime_cases$date <- as.Date(crime_cases$date)
temp_records$Date <- as.Date(temp_records$Date)
# Merge crime and weather data by date
joined_data <- merge(crime_cases, temp_records, by.x = "date", by.y = "Date")
# View structure of combined data
summary(joined_data)
## date category persistent_id lat
## Min. :2024-01-01 Length:6304 Length:6304 Min. :51.88
## 1st Qu.:2024-03-01 Class :character Class :character 1st Qu.:51.89
## Median :2024-07-01 Mode :character Mode :character Median :51.89
## Mean :2024-06-15 Mean :51.89
## 3rd Qu.:2024-09-01 3rd Qu.:51.89
## Max. :2024-12-01 Max. :51.90
##
## long street_id street_name id
## Min. :0.8788 Min. :2152686 Length:6304 Min. :115954844
## 1st Qu.:0.8966 1st Qu.:2153025 Class :character 1st Qu.:118009952
## Median :0.9013 Median :2153155 Mode :character Median :120228058
## Mean :0.9029 Mean :2153873 Mean :120403000
## 3rd Qu.:0.9088 3rd Qu.:2153366 3rd Qu.:122339060
## Max. :0.9246 Max. :2343256 Max. :125550731
##
## location_type outcome_status month season
## Length:6304 Length:6304 Min. : 1.000 Length:6304
## Class :character Class :character 1st Qu.: 3.000 Class :character
## Mode :character Mode :character Median : 7.000 Mode :character
## Mean : 6.481
## 3rd Qu.: 9.000
## Max. :12.000
##
## station_ID TemperatureCAvg TemperatureCMax TemperatureCMin
## Min. :3590 Min. : 7.00 Min. :10.60 Min. : 2.500
## 1st Qu.:3590 1st Qu.: 7.20 1st Qu.:10.90 1st Qu.: 5.400
## Median :3590 Median :11.50 Median :14.70 Median : 8.100
## Mean :3590 Mean :11.67 Mean :15.36 Mean : 8.278
## 3rd Qu.:3590 3rd Qu.:14.50 3rd Qu.:19.30 3rd Qu.:11.700
## Max. :3590 Max. :19.90 Max. :25.70 Max. :15.000
##
## TdAvgC HrAvg WindkmhDir WindkmhInt
## Min. : 3.600 Min. :66.90 Length:6304 Min. : 6.90
## 1st Qu.: 5.300 1st Qu.:77.80 Class :character 1st Qu.:14.20
## Median : 9.700 Median :83.60 Mode :character Median :15.50
## Mean : 8.927 Mean :84.67 Mean :18.51
## 3rd Qu.:12.200 3rd Qu.:92.30 3rd Qu.:24.00
## Max. :13.000 Max. :95.70 Max. :28.90
##
## WindkmhGust PresslevHp Precmm TotClOct
## Min. :20.40 Min. : 990.2 Min. : 0.000 Min. :2.500
## 1st Qu.:37.10 1st Qu.:1001.3 1st Qu.: 0.000 1st Qu.:6.500
## Median :38.90 Median :1014.2 Median : 0.400 Median :7.000
## Mean :43.13 Mean :1011.8 Mean : 1.898 Mean :6.577
## 3rd Qu.:51.90 3rd Qu.:1021.0 3rd Qu.: 3.000 3rd Qu.:7.900
## Max. :61.20 Max. :1027.1 Max. :11.000 Max. :8.000
## NA's :537
## lowClOct SunD1h VisKm Month
## Min. :5.500 Min. : 0.000 Min. : 7.30 July : 608
## 1st Qu.:7.300 1st Qu.: 0.000 1st Qu.:13.90 May : 568
## Median :7.700 Median : 0.600 Median :26.40 February: 546
## Mean :7.433 Mean : 2.493 Mean :26.33 October : 537
## 3rd Qu.:7.900 3rd Qu.: 5.000 3rd Qu.:42.50 August : 533
## Max. :8.000 Max. :11.600 Max. :48.20 January : 529
## (Other) :2983
To analyze crime in relation to weather, crime data and daily weather data were combined. Prior to merging, it was necessary to modify the date format in both data files by using R’s as.Date() function. After lined up for date, a merge was performed based on the date field, to generate a reflected data file of 6,304 observations with crime and weather variables. The new data frame held weather variables of temperature, precipitation (Precmm), relative humidity (HrAvg) and total cloud cover (TotClOct), along with categories, locations and status for each crime ending. This meant that a really interesting way to analyze data existed, in that linking the days of weather events to criminal activity could be done and analyzed spatially.
# Bin average temperature into categories
joined_data$temp_group <- cut(joined_data$TemperatureCAvg,
breaks = c(-Inf, 5, 10, 15, 20, Inf),
labels = c("Very Cold", "Cold", "Mild", "Warm", "Hot"))
# Count crimes per temperature group
crime_temp_relation <- joined_data %>%
group_by(temp_group) %>%
summarise(total_crimes = n())
# Bar plot
ggplot(crime_temp_relation, aes(x = temp_group, y = total_crimes, fill = temp_group)) +
geom_bar(stat = "identity") +
labs(title = "Crimes by Temperature Group",
x = "Temperature Group", y = "Number of Crimes") +
theme_minimal()
The frequency of crimes plotted as a bar chart in figure 6 indicates that the incidence of crimes peaked on “Warm” days followed by “Hot” days; and that fewer crimes occurred when classified as “Very Cold” and “Cold”. These findings align with criminological behavioral theories; warm weather encourages social interaction and outdoor leisure activities, creating opportunities for both interpersonal crime or an opportunity crime.
# Bin precipitation levels
joined_data$rain_group <- cut(joined_data$Precmm,
breaks = c(-Inf, 0, 2, 5, 10, Inf),
labels = c("No Rain", "Light", "Moderate", "Heavy", "Very Heavy"))
rain_crime_relation <- joined_data %>%
group_by(rain_group) %>%
summarise(total_crimes = n())
ggplot(rain_crime_relation, aes(x = rain_group, y = total_crimes, fill = rain_group)) +
geom_bar(stat = "identity") +
labs(title = "Crime by Rainfall Intensity",
x = "Rainfall Level", y = "Number of Crimes") +
theme_minimal()
Rain color-coded into five levels of intensity: No Rain, Light (0-2mm), Moderate (2-5mm), Heavy (5-10mm) , Very Heavy (>10mm). A bar chart that compares these groups demonstrates that dry days had the most crime instances, whereas there were fewer crime instances as the rain level increased. Very heavy days, then, had the least number of crimes reported. This is an inverse relationship between weather and crime, and indicates bad weather may keep people from going outdoors, reducing opportunity for a crime event such as street theft, vandalism, or assault.
# Compare solved vs unsolved crimes by temperature
solved_crimes_by_temp <- joined_data %>%
mutate(solved = ifelse(outcome_status == "Investigation complete; no suspect identified", "Unsolved", "Solved")) %>%
group_by(solved) %>%
summarise(avg_temp = mean(TemperatureCAvg, na.rm = TRUE))
ggplot(solved_crimes_by_temp, aes(x = solved, y = avg_temp, fill = solved)) +
geom_bar(stat = "identity") +
labs(title = "Average Temperature for Solved vs Unsolved Crimes",
x = "Crime Solved Status", y = "Average Temperature (°C)") +
theme_minimal()
This analysis compared the average temperatures for solved vs. unsolved crimes. It revealed that:
Solved crimes occurred at an average of 11.7°C
Unsolved crimes occurred at a warmer average of 13.9°C
This small yet important difference was illustrated through a bar chart. This difference implies that offenses that took place on cooler days had a better likelihood of being solved, perhaps conditions such as smaller crowds, or illumination due to a clearer sky could have alluded to better witness accounts or forensic action.
# Define weather conditions
joined_data$weather_type <- case_when(
joined_data$TemperatureCAvg < 10 & joined_data$Precmm > 1 ~ "Cold & Wet",
joined_data$TemperatureCAvg < 10 & joined_data$Precmm <= 1 ~ "Cold & Dry",
joined_data$TemperatureCAvg >= 10 & joined_data$Precmm > 1 ~ "Warm & Wet",
TRUE ~ "Warm & Dry"
)
grouped_crime_weather_data <- joined_data %>%
group_by(weather_type, category) %>%
summarise(count = n(), .groups = "drop")
ggplot(grouped_crime_weather_data, aes(x = weather_type, y = count, fill = category)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Crime Types in Different Weather Conditions",
x = "Weather Type", y = "Crime Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The results showed that “Warm & Dry” were related to the most number of crimes, particularly anti-social behaviour, public disorder and violence; burglaries and vehicle crime were slightly higher during Cold & Wet days. These trends can be useful for weather cognition and supporting crime prevention, e.g. monitoring public space on sunny days, burglaries during colder months.
# Select three-way relationship for a few top crime categories
crime_rankings <- joined_data %>%
count(category, sort = TRUE) %>%
top_n(5) %>%
pull(category)
## Selecting by n
interaction_results <- joined_data %>%
filter(category %in% crime_rankings)
ggplot(interaction_results, aes(x = TemperatureCAvg, y = VisKm, color = category)) +
geom_point(alpha = 0.5) +
labs(title = "Temperature vs Visibility by Crime Type",
x = "Average Temperature (°C)", y = "Visibility (km)") +
theme_minimal()
The visualization depicted:
Anti-social behaviour and violent crime incidents clustered in what what considered moderate to high temperature and visibility.
Vehicle crime occurred across a wider range of visibility ranges, but clustered around what can be considered mild temperatures.
library(dplyr)
library(ggplot2)
# Categorize using temperature and humidity
joined_data <- joined_data %>%
mutate(climate_group = case_when(
TemperatureCAvg >= 15 & HrAvg < 60 ~ "Hot & Dry",
TemperatureCAvg >= 15 & HrAvg >= 60 ~ "Hot & Humid",
TemperatureCAvg < 15 & HrAvg < 60 ~ "Cold & Dry",
TRUE ~ "Cold & Humid"
))
# Summarize crime counts by climate group
climate_crime_summary <- joined_data %>%
group_by(climate_group, category) %>%
summarise(count = n(), .groups = "drop")
# Plot
ggplot(climate_crime_summary, aes(x = climate_group, y = count, fill = category)) +
geom_bar(stat = "identity") +
labs(title = "Crime by Combined Weather Conditions",
x = "Climate Group", y = "Crime Count") +
theme_minimal()
The bar chart indicated increased rates of violence and public disorder under “Hot & Humid” and under “Hot & Dry” conditions, whereas burglaries and criminal damage were higher under “Cold & Humid” weather. This suggests that crime prevention practices should fully consider a specific climate profile, as opposed to generalized weather class.
# Group crime and cloud data by month
cloud_impact_on_crime <- joined_data %>%
mutate(month = floor_date(date, "month")) %>%
group_by(month) %>%
summarise(total_crimes = n(),
avg_cloud = mean(TotClOct, na.rm = TRUE))
ggplot(cloud_impact_on_crime, aes(x = month)) +
geom_line(aes(y = total_crimes, color = "Total Crimes"), size = 1.2) +
geom_line(aes(y = avg_cloud * 10, color = "Cloud Cover (scaled)"), linetype = "dashed") +
scale_y_continuous(name = "Total Crimes",
sec.axis = sec_axis(~./10, name = "Cloud Cover (Octas)")) +
labs(title = "Crime Trend vs Cloud Cover Over Time",
x = "Month", color = "") +
theme_minimal()
The line chart presented a weak positive correlation between daily cloud cover and crime - more crimes happened on cloudy days when the struggles of winter existed. This may be tied to diminished natural light which may lower public vigilance and provide more opportunities to commit theft. The trend further supports the concept that atmospheric conditions influence human behavior.
This report examined crime trends in Colchester during 2024, integrating over 6,300 crime records with detailed weather data. The analysis revealed clear spatial and seasonal patterns—crime rates were higher in central areas like High Street, East Hill, and Queen Street, and more frequent during warmer months.
Weather conditions played a notable role: warm, dry days saw increased incidents of anti-social behaviour and assault, while cold, rainy days correlated with lower crime. Crimes were also less likely to be solved on warmer days, possibly due to environmental factors like crowd density.
These findings suggest benefits for context-aware policing—such as increasing patrols during warm evenings—and urban planning strategies that factor in environmental conditions, like improved lighting and seasonal crime alerts.
Finally, further research incorporating variables like public holidays, events, and real-time weather could enhance crime forecasting and prevention strategies.
UK Police Data (Crime24 Dataset) Tierney, N. (n.d.). UK Police Crime Data. Retrieved from https://ukpolice.njtierney.com/reference/ukp_crime.html
Climate Data (Temp24 Dataset) Czernecki, B. (n.d.). Ogimet Meteorological Data. Retrieved from https://bczernecki.github.io/climate/reference/meteo_ogimet.html